Additional Tables and Analyses
This page contains additional tables and statistical analyses that complement the main results in the paper.
Term Lists
To build a comprehensive thesaurus of accounting terminology, we construct two alternative accounting term and concept lists following top-down and bottom-up approaches.
Construction Methodology
Top-Down Approach: We collect accounting terms from multiple authoritative sources—IFRS, US GAAP, UK GAAP standards, and specialized accounting dictionaries. Terms explicitly classified as synonyms are grouped by their underlying accounting concepts. The lists are refined using a GPT-based procedure and manually validated to ensure they reflect terminology actually used in practice across our global corpus.
Bottom-Up Approach: We leverage XBRL filings from EDGAR, extracting the different terms firms use to describe the same line item by mapping them to underlying accounting concepts via taxonomy tags. To reduce noise, we impose a frequency threshold (terms must appear in at least 20 distinct filings) and apply a majority disambiguation rule (removing terms appearing in less than 5% of filings for a given concept). This approach ensures we capture variation in real-world reporting practice.
Key Differences
The bottom-up lists contain more terms and include higher n-gram terms (3+ words), reflecting the technical nature of XBRL taxonomies. On average, bottom-up concepts have more synonyms per concept (12.0 vs 5.4 for top-down), higher textual similarity (0.86 vs 0.78), and lower concentration (0.54 vs 0.58). This variation captures not only true synonyms but also different writing conventions used in practice.
Download all term lists in a formatted Excel file (with separate sheets for each dataset):
📥 Download Term Lists (Excel, 2.6 MB)
Term List (Data)
Authoritative Sources Approach
This list is constructed from IFRS, US GAAP, UK GAAP standards, and specialized accounting dictionaries. Terms are grouped by accounting concepts and refined through GPT-based processing and manual validation. The list is restricted to terminology observed in our global corpus, ensuring it reflects actual reporting practice.
Key Characteristics:
- Source: Authoritative accounting standards and dictionaries
- Average synonyms per concept: 5.4
- Textual similarity: 0.78
- Concentration: 0.58
XBRL-Based Approach: U.S. Domestic Filers
This list is derived from XBRL filings (Form 10-K) on EDGAR. We extract terms firms use to label the same line items, mapping them to accounting concepts via taxonomy tags. Terms must appear in at least 20 distinct filings. This approach captures variation in real-world reporting practice for U.S. domestic filers.
Key Characteristics:
- Source: ~50,000 U.S. 10-K XBRL filings
- Average synonyms per concept: 12.0
- Textual similarity: 0.86
- Concentration: 0.54
XBRL-Based Approach: International Filers
This list follows the same methodology as the 10-K bottom-up approach but focuses on non-U.S. firms cross-listed in the United States that file Form 20-F using the IFRS Taxonomy. This allows us to capture terminology variation in international reporting practice.
Key Characteristics:
- Source: 20-F XBRL filings (international filers using IFRS)
- Methodology: Same as 10-K bottom-up approach
- Frequency threshold: 20+ filings
- Disambiguation: 5% threshold per concept
Notes
Column Definitions: - TID: Term ID - unique identifier for each term - nGram: Length of the term (number of words) - Term: The actual accounting term or phrase
Usage: Use the column filters to search for specific terms. Click column headers to sort. The tables show a sample of entries; download the Excel file above for the complete dataset.